Two Stage Max Gain Content Defined Chunking for De- duplication

نویسنده

  • Arul Selvan
چکیده

––Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical content. They then store a single copy of the chunk along with metadata about how to reconstruct the original files from the chunks, this takes up the less storage space than storing the whole thing again. Most of the de-duplication programs are based on fingerprinting. Data de-duplication segments the data into chunks and writes those blocks to disk. And signature is created for each data segment much similar to the fingerprints. And an index is associated for the fingerprint. The index, which can be recreated from the stored data segments. When the repeated new data is encountered in the data stream the de-duplication algorithm inserts a pointer in a metadata which related to the data block which is already present in the data stream. If the same block appears more than once multiple pointers are created for a single block on the disk. This eliminates the redundant blocks in the data stream and improves the storage space efficiency. Content defined chunking means the duplicate blocks are identified based on the internal properties on the data stream. There are many algorithms are in use to identify the repeated content on the data stream. In this paper, a new two stage content based chunking algorithm, the first stage can be any content based algorithm based on the cut points and in the second stage the newly developed max gain algorithm, identifies the new cut points on the repeated data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Survey of Research on Chunking Techniques

The explosive growth of data produced by different devices and applications has contributed to the abundance of big data. To process such amounts of data efficiently, strategies such as De-duplication has been employed. Among the three different levels of de-duplication named as file level, block level and chunk level, De-duplication at chunk level also known as byte level is the most popular a...

متن کامل

Implementation of a File System with Encryption and De-duplication

With the rapid advance of society, especially the development of computer technology, network technology and information technology, there is an increasing demand for systems that can provide secure data storage in a cost-effective manner. In this paper, we propose a prototype file system called EDFS (Encryption and De-duplication File System), which provides both data security and space effici...

متن کامل

Application for Data De-duplication Algorithm Based on Mobile Devices

With the rapid development of Mobile Devices, coupled with the rise of the personal cloud service, the business of Cloud Storage and Cloud Synchronization grows rapidly in recent years. As a result, it imposes pressure on the space of the network storage and the network bandwidth, especially in the field of the Mobile Internet. Data De-duplication algorithm can reduce the data redundancy in the...

متن کامل

Leap-based Content Defined Chunking - Theory and Implementation

Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...

متن کامل

Accelerating Data Deduplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors

As the amount of the digital data grows explosively, Data deduplication has gained increasing attention for its space-efficient functionality that not only reduces the storage space requirement by eliminating duplicate data but also minimizes the transmission of redundant data in data-intensive storage systems. Most existing state-ofthe-art deduplication methods remove redundant data at either ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012